Skip to content

Fix ResourceWatcher Data Race and Redis Connection Leaks#1741

Open
chungeun-choi wants to merge 2 commits intoOT-CONTAINER-KIT:mainfrom
chungeun-choi:fix/performance-optimization
Open

Fix ResourceWatcher Data Race and Redis Connection Leaks#1741
chungeun-choi wants to merge 2 commits intoOT-CONTAINER-KIT:mainfrom
chungeun-choi:fix/performance-optimization

Conversation

@chungeun-choi
Copy link
Copy Markdown

Description

This PR severely boosts the operator’s concurrent throughput and fixes internal blocking bottlenecks when orchestrating multiple RedisReplication resources efficiently.

The detailed changes include:

  1. Fix ResourceWatcher Thread-Safety: Replaced the value receiver with a pointer receiver (w *ResourceWatcher) and implemented sync.RWMutex to protect the watched map against Data Races when MAX_CONCURRENT_RECONCILES > 1.
  2. Fix TCP Connection Leaks: Restructured GetRedisNodesByRole to wrap the configureRedisReplicationClient and defer redisClient.Close() execution in an anonymous function. This ensures stale connections close immediately per iteration rather than hogging connections inside the for loop until return.
  3. Optimize Redundant Topologies: Re-factored the redisreplication_controller.go to merge reconcileRedis and reconcileStatus into a single reconcileRedisAndStatus function yielding a ~50% reduction in concurrent TCP handshakes per reconcile step.

Fixes #ISSUE

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • Tests have been added/modified and all tests pass.
  • Functionality/bugs have been confirmed to be unchanged or fixed.
  • I have performed a self-review of my own code.
  • Documentation has been updated or added where necessary.

Additional Context

In a local constrained environment (Docker Desktop) orchestrating 30 RedisReplication clusters simultaneously:

  • Before fixing: Throughput severely choked around ~3.75 successfully labeled masters per minute due to connection exhaustion and constant TCP timeouts blocking workers.
  • After patching: Throughput rocketed to ~13.12 masters per minute (about ~3.5x improvement) dynamically consuming all the local constraints without causing go routine leaks or panic logs.

This patch dramatically improves the throughput of the Redis operator during
large-scale provisioning contexts when MAX_CONCURRENT_RECONCILES > 1.

1. Fix controllerutil ResourceWatcher concurrent safety (Data Race fix)
2. Wrap GetRedisNodesByRole defer logic in func to prevent TCP connection leak
3. Consolidate reconcileRedis and reconcileStatus to avoid redundant topology calls

Signed-off-by: chungeun-choi <cucuridas@gmail.com>
Significantly increase the default rate limits (QPS) for kube client to
prevent aggressive client-side throttling and delays during scale-out events.

Signed-off-by: chungeun-choi <cucuridas@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant